Predictive modeling of Persian walnut (Juglans regia L.) in vitro proliferation media using machine learning approaches: a comparative study of ANN, KNN and GEP models

Background Optimizing plant tissue culture media is a complicated process, which is easily influenced by genotype, mineral nutrients, plant growth regulators (PGRs), vitamins and other factors, leading to undesirable and inefficient medium composition. Facing incidence of different physiological disorders such as callusing, shoot tip necrosis (STN) and vitrification (Vit) in walnut proliferation, it is necessary to develop prediction models for identifying the impact of different factors involving in this process. In the present study, three machine learning (ML) approaches including multi-layer perceptron neural network (MLPNN), k-nearest neighbors (KNN) and gene expression programming (GEP) were implemented and compared to multiple linear regression (MLR) to develop models for prediction of in vitro proliferation of Persian walnut (Juglans regia L.). The accuracy of developed models was evaluated using coefficient of determination (R2), root mean square error (RMSE) and mean absolute error (MAE). With the aim of optimizing the selected prediction models, multi-objective evolutionary optimization algorithm using particle swarm optimization (PSO) technique was applied. Results Our results indicated that all three ML techniques had higher accuracy of prediction than MLR, for example, calculated R2 of MLPNN, KNN and GEP vs. MLR was 0.695, 0.672 and 0.802 vs. 0.412 in Chandler and 0.358, 0.377 and 0.428 vs. 0.178 in Rayen, respectively. The GEP models were further selected to be optimized using PSO. The comparison of modeling procedures provides a new insight into in vitro culture medium composition prediction models. Based on the results, hybrid GEP-PSO technique displays good performance for modeling walnut tissue culture media, while MLPNN and KNN have also shown strong estimation capability. Conclusion Here, besides MLPNN and GEP, KNN also is introduced, for the first time, as a simple technique with high accuracy to be used for developing prediction models in optimizing plant tissue culture media composition studies. Therefore, selection of the modeling technique to study depends on the researcher’s desire regarding the simplicity of the procedure, obtaining clear results as entire formula and/or less time to analyze.

Background The walnut is one of the most important nuts in the world. Persian walnuts (Juglans regia L.) are the only edible species of walnut which are widely grown for their nuts and timbers [1]. In general, walnut tree propagation is still mainly by using seeds rather than vegetative procedures which results in non-uniform nut quality and irregular yielding [2]. Therefore, in vitro propagation is used to overcome the mentioned problems. But walnuts are considered recalcitrant to in vitro culture which makes difficult the mass propagation of different genotypes while several micropropagation protocols have been published for different genotypes [3][4][5][6][7][8][9][10][11]. It has been proven that walnut micropropagation results are highly dependent upon genotype [7,[9][10][11]. In addition to genotype, the formulation of culture medium has a great impact on all micropropagation stages. Up to now, the [3] walnut (DKW) culture medium has been the most employed formulation for walnut tissue culture. Nevertheless, there are some researches reporting improved results using modified DKW or other formulations [6][7][8][12][13][14][15].
However, to the best of our knowledge, no comprehensive study has been done on the balance of culture media components (mineral nutrients, plant growth regulators (PGRs) and vitamins) and their interaction together and with genotype on walnut in vitro performance to increase the efficiency of the micropropagation process by enhancing proliferation rate and reducing physiological disorders.
Predicting the interaction of mineral nutrients, PGRs, vitamins and genotype on the explant in vitro performance would involve modeling a very complex database, which is very problematic and time-consuming process using classic statistical analyses and needs accurate and advanced modeling procedures [16,17]. Machine learning (ML) tools allow researchers to perceive the studied process and make proper decisions to develop optimal culture media [17]. In recent years, different ML models like neural networks [18][19][20][21][22][23] have been successfully applied for prediction and optimization of different plant tissue culture processes. In our previous studies, we described the ML hybrid techniques, combining artificial neural network (ANN) with genetic algorithm (ANN-GA) in Pyrus [24] and Prunus rootstocks [25][26][27], rootstocks gene expression programing (GEP) with GA (GEP-GA) [20] and particle swarm optimization (PSO) (GEP-PSO) [28] in Pyrus rootstocks as powerful data mining approaches, which allow modeling of complicated databases and finding the factors influencing a given response in micropropagation process.
ANNs are inspired by the functions of human brain [29]. The ANN [multi-layer perceptron neural network (MLPNN) and radial basis function neural network (RBFNN)] has revealed significant development in complex plant tissue culture systems [20,[24][25][26][27]. ANN does not require any previous knowledge regarding the creation or interrelationships between signals of input and output that is one of its profits [16]. Other benefits of ANN are prediction of the plant biomass [30], clustering the micropropagated plantlets and influencing growth and quality of the regenerated plants by controlling light, ventilation, CO 2 and air temperature inside the culture containers which could be of ANN benefits [16].
GEP model is another ML-based optimization technique presented by [31] which comprises useful traits of both genetic programming (GP) and GA. This new model according to an evolving computer programs algorithm was used in our previous studies on Pyrus rootstocks micropropagation which precisely detected nonlinear and complicated relationships between input and output [20,24].
Here, ANN and GEP are compared to k-nearest neighbors (KNN) method as one of the simplest machine learning techniques. The KNN technique recognizes the elements amongst the training samples that correspond "current" conditions maximum closely based on some predefined attributes: the neighbors. The prediction value is then specified from the groups of the next values of the neighbors [32]. Comparing to mathematical modeling, the KNN method involves no model development or confirmation and thus can be used without recombining data, contrasting in the case of common data-based models [33]. In spite of the potential advantages, no research has yet been done on the use of this technique in the area of plant micropropagation.
In our previous study [20], we compared the RBFNN and GEP in optimizing the in vitro culture media composition for pear rootstocks. Based on our results GEP was a significantly powerful and more precise technique than RBFNN in prediction of in vitro proliferation quantity and quality. So, GA technique was applied to optimize GEP models [20]. Nevertheless, GA optimized the level of inputs required for each specific output, distinctly. Consequently, in our recent study [28], in order to achieve a complete optimum formulation for culture medium, we compared two algorithms GEP and M5' model tree, to predict the impacts of media minerals and PGRs on in vitro proliferation of pear rootstocks. We found that GEP showed a higher prediction precision than M5' model tree. So, we optimized the GEP prediction models using multi-objective evolutionary optimization algorithms (MOEAs) including GA and PSO methods and compared to the mono-objective GA optimization procedure. The PSO optimized GEP prediction models made the best outputs in both rootstocks [28].
With MOEAs, inputs are evaluated as multi-objective optimization problems (MOPs) and the solutions specify the best probable balance between two reverse functions [34]. Recently, several mathematical methods have been used to solve MOPs, nonetheless the real MOPs applications are specifically nonlinear and also occasionally non-differentiable [35]. This has enhanced interest in metaheuristic methods, and among these procedures, MOEAs are of special interest. Here, PSO as an evolutionary computation technique was used for determining optimized culture media.
The aim of this study is to employ three soft computing methods namely MLPNN, GEP and KNN and to compare the accuracy of their prediction to multiple linear regression (MLR) technique as well as applying PSO algorithm with aim of predicting and optimizing walnut tissue culture media. Briefly, the new contributions of the present research are: • Comparing the appropriateness of MLPNN, KNN and GEP nonlinear methods for modeling the impacts of mineral nutrients, PGRs and vitamins on in vitro culture of walnut. • Constructing hybrid models in order to assess how Chandler and Rayen explants respond to the culture medium composition according to the new produced shoots attained from the Taguchi design. • Finding the optimal composition of culture media to maximize the proliferation rate (PR) and minimize callus weight (CW), shoot tip necrosis (STN) and vitrification (Vit) by optimizing the developed model using PSO.
To our knowledge, this study is the first application of MLR, KNN, ANN, GEP and PSO methods for optimizing walnut tissue culture media. In addition, this work is the first use of KNN modeling procedure in plant tissue culture.

Results
Our models of the interaction of modifying inputs including nutrients, PGRs and vitamins on outputs including PR, CW, STN and Vit were developed using MLR, MLPNN, KNN and GEP techniques. Here, we assess the developed models' performances through evaluating each modelling method precision to predict the composition of plant micropropagation media for walnut. After that, PSO optimization results of the selected modeling method is investigated to find the most efficient compositions of media for each considered trait. An outline of the techniques used here to achieve the most appropriate model is shown in Fig. 1.

Comparison of modeling techniques performances
The mathematical equations attained from GEP method, which is showing the best estimate of the explant growth parameters, are shown in Table 1 Table 2).
Comparison of the observed and predicted values of outputs may explain the performance of the developed models according to the studied inputs. A high squared correlation coefficient fitting technique was used to produce plots according to the constructed models derived, to show how each of the four outputs varied as the concentration of media components changed. The plots may be helpful to understand the complete relationship between media components and responses, and to assess the multiple effects of modifying the media components in the DKW medium.  (Fig. 7), 0.435, 0.976, 0.975 and 0.853 for Vit of Chandler (Fig. 8); and 0.300, 0.979, 0.978 and 0.891 for Vit of Rayen (Fig. 9), respectively. Therefore, the ML models were able to accurately predict the outputs while the MLR developed models were not able to describe extensive diversity of growth parameters owing to the studied variables interaction, that may hide the effects of media components. Figures 2, 3, 4, 5, 6, 7, 8 and 9 may be helpful for realizing the complete relationship between media components and responses, and assessing the combined impacts of modifying the DKW medium components.
According to the results presented in Table 2 and Figs. 2, 3, 4, 5, 6, 7, 8 and 9 as well as the above-mentioned results, MLPNN, KNN and GEP models performed accurately in predicting the effect of media components on in vitro performance of Persian walnut. So, in order to select one of these ML modeling techniques to be optimized and achieve final models for in vitro proliferation of Persian walnut, we considered the ease of using model by the end user. In other words, although MLPNN and KNN performed relatively well, none of these models offer explicit mathematical expression. Unlike MLPNN and KNN methods which produce black-box models, GEP can provide the researchers with an opportunity to optimize the extractive equations (optimal values of the variables) by generating explicit mathematical equations between the independent variable and the dependent variable and can be used as an equation for the pre-test stages (initial phase of the study) in designing and developing of their studies. Hence, we selected GEP models to be optimized and achieve proliferation media formulations of Chandler and Rayen.

Optimization of GEP models
Consequently, to achieve the optimized medium resulting in the highest PR and the lowest CW, STN and Vit in walnut, we optimized developed GEP models by using multi-objective PSO technique.
The optimized amounts of the studied factors and the predicted values of growth parameters by the GEP models are shown in Table 3 Table 1 Constructed models using gene expression programming to predict explant growth traits in Persian walnut X1-X9: are input factors as presented in Table 3 Walnut Equation

Discussion
Walnuts as one of the important woody plants are considered recalcitrant to in vitro culture in which genetic determinism besides other factors such as media components makes more complicated different stages of micropropagation, as well. In the present study, three different ML modeling approaches along with PSO optimization algorithm were applied to determine and predict the effect of genotype and the media formulation throughout the proliferation of walnut. Walnut micropropagation can be improved by involving different physiological disorders in modeling and optimization processes. The incidence of physiological disorders through micropropagation of walnut has not been comprehensively investigated. Different studies on walnut tissue culture have been focused on introducing some chemicals like phloroglucinol and FeEDDHA to DKW or [36] (MS) basal  [11,15], supplementing media with various concentrations of different PGRs [37][38][39], removing agar [39], ventilation and reducing sucrose concentration [40], but a few of the studies focused on media components, including mineral nutrients [9,41], vitamins and PGRs [39] interaction on proliferation quality and quantity.
Here, we concentrated on increasing PR and reducing important abnormalities occurring during this phase, by recording data associated to several designed experiments. The subsequent database including a range of concentrations of each component in culture media allows simultaneous evaluation of the impacts of all minerals, vitamins and PGRs used in media as well as genotype on the explant growth indices only through the ML tools.
Machine learning as a powerful tool has been effectively applied in plant biology studies [42,43] including plant tissue culture data analysis and accurate prediction of optimal in vitro culture media composition [20,[24][25][26][27][28]. The development of in vitro plant tissues is controlled by minerals, vitamins and PGRs in the culture media. To achieve maximum explant performance, the prediction of the most efficient media composition is highly useful since the optimization of the type and concentration of minerals, vitamins and PGRs in media is a time-consuming, expensive and laborious job [9,41].
In our previous studies, we successfully performed constructing neural models using ANN technique to study the effects of different combinations of minerals and PGRs on in vitro proliferation and rooting of G × N15 Prunus rootstock [25][26][27]. Our study on comparing ANN with MLR modeling to forecast the optimum concentrations of macronutrients for OHF 69 and Pyrodwarf Pyrus rootstocks in vitro media showed ANN as a precise and promising technique [24]. The important benefit of ANN-based methods is that they do not need a prior identification of proper fitting function consequently; they have an overall approximation ability to calculate all kinds of non-linear functions in practice. This trait may help the modeler to develop the most possible precise model. Despite the fact that ANN is a good alternative for MLR, it does not provide us any equations including the relationships between input and output variables. Moreover, the ANN technique needs a time-consuming process of trial and error to find network parameters like number of neurons and hidden layers [44][45][46]. ANNs as the most extensively used ML model, can efficiently solve  [47]. Nevertheless, there are also some drawbacks with ANN "black box" nature [48]. In general, ANN is unable to clarify its logical process and this constraint makes ANN application unfriendly in natural science studies, as it can just simulate the change process according to experimental data, without helping us to understand the reason of the change.
Considering these restrictions in using ANN models, in another study on Pyrus rootstocks in vitro proliferation [20] we compared the power of GEP technique to ANN (RBFNN) and MLR in predicting the optimal media. RBFNN and GEP exhibited higher performance precision towards the MLR, and the GEP resulted in the most precise model as well as being practical [20]. In our recent research [28], we used two algorithms, GEP and M5' model tree to overcome the ANN method weaknesses and simplify forecast of the media components interactions on in vitro proliferation of Pyrus rootstocks. Again, we found GEP as a more accurate technique than M5' model tree [28].
Consequently, in the present study, we applied GEP as the most precise modeling procedure found by [20,28], MLPNN as an ANN technique that its models are easier to give precise prediction than RBFNN when input data are randomly distributed [49] and KNN as one of the simplest machine learning approaches which can also be used for regression problems [50]. The MLR was also applied as a linear modeling method to be compared with above-mentioned ML procedures in predicting the optimum in vitro proliferation media composition of walnut to achieve the most appropriate outcomes. The accuracy of the developed prediction models was evaluated using MAE, RMSE and R 2 statistics and correlation coefficient between observed and predicted values of each output. To our knowledge, KNN algorithms have not ever been applied to predict the plant tissue culture media composition. The advantage of KNN algorithm is that it does not require specific assumptions about the predictors' distribution. The samples of KNN are classified according to the k neighbor responses mean values in a space of predictor [51]. The examples of training are defined by n traits. Each example means a point in a space with n-dimension. So, all examples of training will be kept in a space with the pattern on n-dimension. Here, the number of neighbors (k) leading to the best results for each model are presented in Table 3. A key advantage of GP-based procedures such as GEP, toward other methods is that they do not need any hypothesis for preceding form of the relationship to produce prediction equations. GP and its deviations have been applied in many researches to find any complicated relationships which fit different experimental data [52][53][54]. An individuals' population is employed in this technique and afterwards, better individuals are chosen by using genetic variations and fitness function. The genetic variations are introduced by genetic operators. Machine learning approaches including GEP have been programed to learn the variables̛ relationships in data collections. GEP difference with GA and GP as its precursors is in the method of individual programming so that in GEP, individuals are programmed as chromosomes i.e. fixed length linear strings which are presented finally as a simple diagram called expression tree. Whereas, in GA and GP, individuals are expressed, as nonlinear entities with different shapes (parse trees) and sizes and chromosomes, respectively. One of the GEP strengths over GA and GP is that genetic operators work very simple at the level of chromosome in GEP making development of genetic diversity. GEP unique, multi-genic nature is another important point which allows more complicated programs with multiple sub-programs to be developed. The advantages of both GA and GP are collected in GEP, whereas some of their constraints are met [55].
Based on our results presented in Table 2, KNN, MLPNN and GEP models were much more accurate than MLR. On the other hand, in most cases, the MLPNN method provided better fit calculation than KNN and GEP. But based on the results of our aforementioned studies [20,28], the optimized GEP method provides better fit calculation than other approaches. Furthermore, GEP is preferred over ANN models, as ANN is a blackbox model, whereas GEP explains the constructed prediction models with mathematical Eqs. [54].
Through the previous years, GEP has been applied extensively in other areas because of its high efficiency and effectiveness. GEP applications are so wide and are rapidly enhancing [55]. GEP is one of the most effective function mining algorithms which has been widely used in classification, pattern recognition, prediction, and other research areas. This algorithm can mine an ideal function to deal with further complex tasks [56]. GEP has been used to determine the quality and stress of water on lakes or rivers as a result of the wastewater pollutants [31]. The problem of missing values in data set due to the measurement conditions can simply be solved by employing GEP [31]. Results based on actual data set confirmed that the multiple GEP and fuzzy expert system outperforms detection methods in medical field by attaining high prediction precision [57].
Our previous studies [20,24,28] on pear rootstocks using ML-based modeling showed that there are different responses to the concentrations of macronutrients and PGRs based on genotypes, as we found here in Persian walnut varieties. Regarding the complex interactions, detection of the optimum levels of minerals and PGRs for a certain plant genotype is complicated [58]. Furthermore, the incidence of physiological disorders like Vit and STN throughout the proliferation phase of walnut needs improvement of media for optimal growth of explants. Constructing optimized and effective media by using authentic mathematical modeling and optimization methods have been performed previously on different plant species [17,20,[24][25][26][27][28][59][60][61]. Here, we consequently suggested use of ML-based modeling to recognize concentrations of minerals and PGRs that would maximize PR while minimizing CW, STN and Vit [24]. As we found here (Table 3), our previous results on pear [20,24] showed that ANN prediction models had higher precision than MLR models and MLR could not be a trustworthy method for assessing nonlinear or nonpolynomial relationships among variables.
It has been revealed from our recent study on pear rootstocks micropropagation [28] that the most efficient optimization method for optimizing GEP models was multi-objective PSO. Therefore, here, we used multiobjective PSO method for optimization of selected GEP models. Our GEP-PSO optimized models could give us intact optimized formula for proliferation of Chandler and Rayen (Table 3).
The mono-objective GA optimized MLPNN and RBFNN-based models obtained in our previous studies [20,24]  towards Mg 2+ , Cl − and SO 4 2− for in vitro proliferation. Our recent study [28] on Pyrus rootstocks using monoobjective GA optimization of GEP models indicated that high PR may cause low quality plantlets. In accordance with it, our study [25] on G × N15 using mono-objective GA optimization of ANN models also predicted that increasing the NH 4 + concentration will enhance shoot number and length with higher number of non-healthy shoots but decreasing amount of NH 4 + will enhance the plantlets quality. Our results [28] on pear rootstock using RBFNN and GEP modeling procedures also indicated that a lower content of nitrogen will result in higher quality plantlets. NH 4 + , NO 3 − and K + interaction has been the main subject of most in vitro studies [62] but using ML models, [63] reported interaction of K + , EDTA − and SO 4 2− with critical effect of K + on PR of pistachio; as low and high concentrations of K + resulted in the highest and lowest PR, respectively. Study on Prunus sp. also showed that K + at low concentration promotes PR [64]. Nezami-Alanagh et al. [63] concluded that either low or too high amounts of K + , EDTA − and SO 4 2− ions result in low quality plantlets. Considering macro-and microelements, our multi-objective PSO optimized GEP models in Chandler showed that increasing NH 4 + , NO 3 − and SO 4 2− increased PR and Vit while decreasing CW and STN. But the results in Rayen showed that increasing SO 4 2− except K 2 SO 4 as well as increasing NO 3 − except KNO 3 increased PR and CW while decreasing STN and Vit (Table 3).
Reed et al. [65] emphasized on the optimization of nitrogen components content of the culture media to stimulate high number of elongated shoots and reduced amount of callus, in different pear species. Nezami-Alanagh et al. [66] suggested avoiding high content of NH + to reduce callus formation in the in vitro pistachio shoots. Low amounts of some of the MS medium components such as KNO 3 , MgSO 4 , KH 2 PO 4 , CaCl 2 , and NH 4 NO 3 have been reported to contribute to STN promotion in some Pyurus species [67]. Whereas based on our results, lower concentrations of K 2 SO 4 , MgSO 4 , MnSO 4 , CuSO 4 in Chandler and K 2 SO 4 and KNO 3 in Rayen reduced the occurrence of STN. The results of [63] using neurofuzzy logic showed that low amount of K + and mid to high concentrations of SO 4 2− inhibit the STN in pistachio explants with lower signs in UCB1than in Ghazvini which refers to the genotypes differences as we found in our study. Ion confounding problem again prevents determining exact relationship between a given mineral and the physiological disorder.
The neurofuzzy logic procedure show a linear positive effect of nicotinic-acid and pyridoxine-HCl on pistachio parameters of shoot multiplication [68], but, to our knowledge, there is no study about the impact of vitamins on the proliferation of walnut. Nezami-Alanagh et al. [66] showed that the glycine and thiamin-HCl affected differently on some in vitro disorders of pistachio. They showed that increasing glycine content highly reduced the development of callus. Our study showed that higher content of vitamins reduced CW in Chandler (Table 3) and reduced vitamins content in Rayen which caused higher CW (Table 3). Rayen was more recalcitrant to micropropagation than Chandler, hence, achieving higher PR and lower incidence of STN and Vit can cover the low increase in CW.
Genotype is an important factor influencing the occurrence of physiological disorders in walnut which is in agreement with reports of [63,66] in pistachio. Similarly, other researches on pear [20,24,28,67] explained that the in vitro physiological disorders incidence caused by unbalanced mineral nutrition differed among genotypes.
The purpose of our study was to present an ML approach with high accuracy for prediction of optimized culture media. We applied techniques of MLPNN, KNN and GEP combined with PSO to walnut proliferation data sets to achieve the most appropriate proliferation results. Comparison of our results with the previous ones [20, 24-28, 63, 64, 66] indicates that using at least two methods together results in more precise consequences. So that, comparing the results of the used methods showed the effect of media components enhancing or reducing the measured parameters ( Table 3). The efficiency of the developed optimized media was compared to DKW. The media constituents proposed by our PSO optimized GEP models related to Chandler showed that decrease in K 2 SO 4 , MgSO 4 , MnSO 4 , CuSO 4 and BAP besides increase in other nutrients, PGRs and vitamins increased PR as well as Vit while reducing CW, and STN. Nevertheless, it was slightly different for Rayen as decrease in K 2 SO 4 , KNO 3 , vitamins and BAP along with increase in remained nutrients and PGRs caused higher PR and CW but lower STN and Vit ( Table 3). The use of macro-and micro-nutrients as factors, in many micropropagation studies [20,24,28,63,68], indicates the ion confounding problem, being problematic to recognize precisely corresponding ion(s) affecting the studied parameter [69]. Our results in comparison to previous studies on walnut [11,15,[37][38][39] which were about minerals and/or PGRs effects, showed for the first time that not only the effects of minerals depend on the used PGRs concentration but vitamins concentration affects the explant response. The interaction of minerals, PGRs Plant PGRs interactions make a critical complication in regulating the processes of plant growth, as well. Cytokinin controls cell proliferation [70] and auxin enhances the sensitivity of apical meristem less mitotically active cells to cytokinin [71]. Cytokinin to auxin ratio is a key signal which controls phenotype [72]. As auxin and cytokinin have roles in DNA replication and cell cycle regulation, respectively [73]. PGRs effects may vary with plant species. Ref. [26] results on Prunus rootstock indicated that applying cytokinin and auxin together will result in higher PR than employing each one alone. According to their results, PGRs concentration and interaction are also important. According to these results and [74] and [75] findings, we used various concentrations of BAP, TDZ and IBA in our experiments. Our adverse results can be attributed to the interaction of genotype and culture medium constituents [76] with PGRs [20]. Type and concentration of cytokinin highly affected in vitro growth and survival of black walnut [39]. Ref. [37] reported that lower concentrations of zeatin was better than BAP for fast shoot elongation of black walnut nodal explants, while higher levels of zeatin and BAP led to shoot necrosis. Using TDZ at 0.01-0.02 mg/l in the medium resulted in an enhanced rate of morphological disorders [37]. But higher levels of TDZ (1.30 and 0.52 mg/l in Chandler and Rayen, respectively) in our present study resulted in reduction of STN in both Chandler and Rayen. Juglans regia was successfully micropropagated using 0.1-2.01 mg/l BAP [4,8,12,[77][78][79][80][81]. Our used BAP concentrations (0.67 and 0.99 mg/l for Chandler and Rayen, respectively) are also in this range. There is no result in the literature about the effect of BAP on the incidence of walnut in vitro physiological disorders. But according to the results of in vitro studies on other plant species like pistachio [64,82,83], addition of adequate amount of BAP strongly decreases the incidence of STN.
Therefore, in the present study, we evaluated the interaction of cytokinin and auxin PGRs and medium components including nutrients and vitamins on proliferation of walnut to achieve the most efficient protocol with a reasonable range of PGRs. Our analyses using PSO optimized GEP modeling technique showed that this method can be used as an efficient procedure for evaluating the interaction of different factors on walnut explant growth indices in proliferation phase. Therefore, for the first time GEP is introduced as a great tool in optimizing higher quality and efficiency walnut tissue culture protocols in less time. Callus development during explant proliferation is a common problem in walnut micropropagation which has been reduced here by increasing PR in Chandler while enhanced by increasing PR in Rayen (Table 3). Yegizbayeva et al. [15] reported that callus formation is not correlated with PR in walnut. Callus formation has been attributed to certain concentrations of different mineral nutrients in various plant species like KH 2 PO 4 , CaCl 2 and MgSO 4 in some Prunus cultivars [67], NO 3 − in germplasms of Robus [84] or MgSO 4 in Prunus armeniaca [85]. Akin et al. [86] reported NH 4 + and after that genotype and SO 4 2− as significant factors affecting callus formation in hazelnut in vitro proliferation using CHAID analysis. Nezami-Alanagh et al. [63] using neurofuzzy logic predicted that high and low concentrations of Fe 2+ and SO 4 2− , respectively, result in the lowest callus formation in pistachio rootstocks explants. They suggested that lower concentration of SO 4 2− in MS reduces shoot tip necrosis and callus development in pistachio in vitro proliferation. While our results showed that lower concentrations of both FeEDDHA and minerals containing SO 4 2− in DKW caused lower CW in both Chandler and Rayen. Bosela et al. [37] showed that the high-salt media i.e. DKW and MS resulted in lower Vit vs. WPM and 1/2X DKW media in walnut.

Conclusions
Walnut micropropagation is a problematic process with lots of in vitro drawbacks including necrosis, callusing and vitrification. The present study demonstrated the efficiency of plant in vitro proliferation predictive models by using advanced ML modeling procedures. Therefore, a regression model i.e. MLR and three advanced ML models including MLPNN, KNN, and GEP were constructed to predict walnut in vitro PR and associated physiological disorders under the effect of culture medium constituents and genotype. According to the results, following conclusions and suggestions are presented: • Advanced computational models are the highly precise approaches which can be applied to control  Table 3 Multi-objective PSO optimization of GEP models to achieve the highest quantity and quality through in vitro proliferation of walnut  16.12 and predict walnut explant in vitro performance. They can also be employed as an alternative technique for linear regression and usual statistical analysis methods with noteworthy performance among them. The KNN model has been used for the first time in this study for predicting plant in vitro performance. The optimized models should be applied to predict walnut PR in experimental designs for controlling undesirable physiological disorders. • All ML models performed accurately for forecasting PR, CW, STN and Vit. Nevertheless, the accuracy of the GEP models were mostly higher than ANN and KNN models. So, the GEP models were selected to be optimized by PSO technique in order to achieve optimal culture media. • Using above-mentioned ML models is extremely useful for reducing time and cost for formulating efficient walnut tissue culture media. • The ML-designed media for walnut can not only raise PR (especially about Chandler) but, simultaneously, reduce CW, STN and Vit. • Genotype is a very important factor which affects the in vitro performance and based on our results, it seems that Rayen as a not bred genotype is more recalcitrant to in vitro propagation than the bred cultivar Chandler. • Other factors such as sucrose along with our studied medium components and their interaction on PR and occurrence of physiological disorders also need to be incorporated into the predicting model to control the PR comprehensively.

MLR, MLPNN, KNN and GEP modeling techniques
were applied to make models using various arrangements of minerals, vitamins and PGRs with different concentrations as inputs and different proliferation indices as outputs. The selected models were used to achieve the optimized models using PSO. Two case studies were done using walnut cultivar Chandler and genotype Rayen which have explained details of the used procedures to understand the optimized inputs combinations as follows.

Case studies
In vitro established nodal cultures of Chandler and Rayen were sub-cultured in altered DKW media supplemented with various auxin and cytokinin PGRs concentrations, 30 g/l sucrose and 3 g/l Gelrite. The media were dispersed into jam jars (250 ml) with polyethylene caps after adjusting pH to 5.5. Then, the distributed media were autoclaved for 15 min at 1 kg cm −2 s −1 (121 °C). The cultures were kept under 16-h white fluorescent (80 µmol m 2 s −1 ) light at 25 ± 2 °C for 30 days. Subsequently, parameters comprising PR, CW, STN and Vit were measured. In each experiment set, every treatment included 8 replicates (jam jars) for both Chandler and Rayen.

Taguchi experimental design for optimization of explant proliferation
Taguchi design is a strong and effective tool for the process of optimization that functions constantly and optimally through different conditions. Evaluating numerous factors with limited runs is possible via Taguchi designs i.e. orthogonal arrays. In this design, factors are not weighted more or less in the same experiment and therefore all factors are analyzed independently to each other. Deviation of a product efficient characteristics from their target values is produced by some noise factors such as human errors. Based on orthogonal arrays of Taguchi's, a standard orthogonal array L 27 (3 5 ) 27 experiments by 26 • of freedom were applied for each of Chandler and Rayen to evaluate the effect of nine factors according to Table 4, on PR, CW, STN, and Vit. For each experiment, three different levels of factor variations were based on various coefficients × DKW basal medium nutrients and different PGRs concentrations (Tables 5). Every nutrient and PGR concentration treatment includes at least 8 replicates. 157 experimental sets (70% of data lines) among 224 sets were randomly chosen for training the modeling methods and the rest 67 sets (30% of data lines) were applied for testing the model's generalization capacity. In all ML models, k-fold (k = 10) cross validation method [87,88] was used for training to maintain and grantee the generalizability of constructed models.

Multiple linear regression
MLR analysis is a multivariate statistical method to assess the relationship between multiple independent variables and an individual dependent variable. Two important purposes of MLR are prediction and explanation. The MLR prediction comprises the level to which the independent variables can predict the dependent variables. The mentioned description of MLR estimates the coefficients of regression, their sign, magnitude and statistical interface, for each independent variable [89]. Linear regression is considered as the first statistical method in regression and assumed to be an index technique to be used by new methods. As other regression methods, the relationships between a dependent variable and multiple Table 4 The components and levels of factors used for walnut micropropagation and measured traits mean values applied to characterize it

Multilayer perceptron neural network
The neural network is divided into various types based on the transfer functions basis. In the present study, we used multi-layer perceptron (MLPNN) network. The MLPNN model is the most common and widely used type of artificial neural network [90]. This model generally contains an input layer and an output layer. One or more hidden layers can be placed between these two layers. Each neuron in this structure has a number of inputs and a number of outputs. A neuron calculates its output responses based on the weighted sum of all its inputs, performed by a stimulus or transmission function. In the MLPNN model, starting from the input information in the first layer (independent variables), the information flows in only one direction and enters the output layer (dependent variable) by transferring from the hidden layer. The training process of MLPNN model involves adjusting and modifying the weights of the interface between neurons using different network training methods [91]. In this study, Broyden-Fletcher-Goldfarb-Shanno (BFGS) training algorithm has been used. Also, stimulus functions; The tangent hyperbolic (Tanh), sigmoid function (Logs), exponential function (Exp), relu function (Relu) in the hidden layer and linear function (Idn) in the output layer were compared and evaluated and the best function was selected. The number of hidden layers was also determined by trial and error by reaching the minimum error rate. See [91,92] for more information.

k-nearest neighbors' algorithm
The k-nearest neighbors (KNN) model is a non-supervised learning machine algorithm for data classification. In this model, each data represents a coordinate position in a vector-space model that the information of each particular section must have similar properties as well as be close to each other. In the KNN algorithm, determining the number of neighbors (k) as well as the method based on which the distance between them is calculated is of particular importance. If k is considered too small, then neighboring points that do not appear in the classification will reduce the accuracy of the results. On the other hand, if k is considered too large, the results of the same classifications may be merged as the computational volume increases [93]. The nearest neighbor was evaluated and selected from different values to find the best value of k and to achieve the highest model accuracy. Distances between neighboring points were determined using various geometric methods. In this study, the methods of Euclidean Distance, Chebyshev Distance, Manhattan Distance and Minkowski Distance were studied and the best method was selected.

Gene expression programming
GP is a modeling approach used to model the structural engineering complications behavior. It is an extension of genetic algorithm that utilizes a program space for  Fig. 10 Diagram of gene expression programming as a prediction model searching, rather than using a data space. An important benefit of applying GP-based techniques toward other methods is their capability to produce equations of prediction without using any hypothesis for previous relationship form. Many researchers have applied GP and GP-based methods to find any complicated relationships fitting different experimental data [44,94,95]. GEP has been introduced as an effective substitute method to the conventional GP [31,46]. GEP have established many computer programs, by getting encoded in linear chromosomes with constant length, each of which included several encoding genes [31,96]. GEP is originated of evolutionary algorithms such as GA and GP. In this technique, an individual population is applied and afterwards, fitness function and genetic variations are used to select better individuals. The genetic variations are presented by genetic operators. GEP is a learning machine which is assumed to learn the variables relationship in datasets. The individual programming technique is different in GEP and its predecessors GP and GA since GEP programs individuals as linear strings (chromosomes) with fixed length which are finally displayed by expression trees as unsophisticated diagram. While, GP and GA express individuals in the form of linear strings (chromosomes) with fixed length and nonlinear entities of diverse forms (parse trees) and dimensions, respectively. One of the strongpoints of GEP towards GP and GA is that genetic operators run very easily at the level of chromosome in GEP producing genetic diversity creation. Another strength of GEP is its exclusive, multi-genic nature letting more complicated programs with numerous subprograms to be developed. Both GP and GA advantages are collected in GEP, whereas some of their constraints are met [57,97,98].
The real GEP chromosome phenotype is the illustration in Fig. 10 and the genotype would be simply described of the phenotype as represented in Eq. (1) Functional steps of the GEP are represented in Fig. 10 [31]. According to this diagram, the GEP start point is a population of chromosomes. After that, the chromosomes genes are expressed, and each individual fitness is analyzed. Then, the individuals are defined according to their fitness to reproduce with alteration. The same development process is run on the new individuals' generation. Overall, this technique is replicated for a particular number of generations or it is performed until reaching a termination condition. Roulette wheel sampling with elitism is employed by GEP system to ensure that the top individuals, according to the fitness, are remained and copied to the next generation. Once genetic operator(s) are performed on chosen chromosomes, comprising mutation, cross over and rotation, diversity is developed into the population.
The GeneXpro software package was applied to perform the GEP models. The parameters employed in the GEP models are represented in Table 6.
In this study, the selected functions and mathematical operators are rational and not definite so that the plant modeling designer is free to select such functions according to the studied problem anatomy. The functions and operators were selected with a viewpoint of invocating simpleness of the advanced model assuring quicker convergence. The size of the population (chromosomes number) adjusts the programs number into the population. The larger the population, the longer it takes for an iteration run. High chromosomes number were tried to (1) (a + b) * (c − d) realize minimum error models. The program running continued to reach no significant rectification in the models' performance. Here, we aimed to achieve obvious relationship between decision variables and response variables. GEP clear formulations were obtained for PR, CW, STN and Vit as a function of experimental parameters including Y1, Y2, Y3 and Y4 = f (X1, X2, X3, X4, X5, X6, X7, X8 and X9) ( Table 1). Input data were normalized in the range of 0 and 1 according to the Eq. 2: where X n is normalized dimensionless data, X i is observed data, X min is the minimum amount of observed data, and X max is its maximum value.

Comparison of the performance of developed models
To evaluate the precision of created models, we used different statistical indices including coefficient of determination (R 2 ), root mean square error (RMSE) and mean absolute error (MAE) based on Eqs. 5, 6 and 7: where y and y are observed values and their mean and y and y are predicted values and their mean, respectively, as well for N samples. Analyses of parameters were performed together to achieve an accurate medium composition.
In addition, the predicted values by the developed models were plotted against the corresponding observed values to evaluate the ability of models for prediction.

Particle swarm optimization of GEP models
PSO is a method of evolutionary calculation and swarm intelligence algorithm according to population to solve the pervasive problem of optimization that was developed by [99]. It is a method of mathematical computation that starts with the swarm (a population of grain) and mostly based on social models, such as the swarm theory, fish schooling and bird flocking [20]. PSO key factors are with behavior of swarm i.e. keeping optimum distances between different members and their (2) X n = X i − X min X max − X min neighbors. To optimize each particle location, their position is modified as arranged for the objective function within the search area. Thus, PSO key factor is a particle velocity which is compared to the previous one in each repetition to lead the particle to its optimal position. The best solution (fitness) every particle in a swarm achieves so far in each repetition, named pbest. Extra "best" value that a particle is attained in the population up to now followed by the particle swarm optimizer which is global best, named gbest. Each particle velocity in a swarm is estimated by Eqs. 3 and 4 [99].
in which, V i+1 is each particle new velocity based on prior velocity (V i ), w is inertial coefficient (0.8-1.2), c 1 and c 2 are cognitive and social coefficients, respectively (0-2), r 1 and r 2 are random values for each velocity update (0-1) and X i+1 is new location for each particle according to the prior location (X i ).